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Atomistic molecular simulations are a powerful way to make quantitative predictions, but the accuracy of 
these predictions depends entirely on the quality of the forcefield employed. While experimental measure¬ 
ments of fundamental physical properties offer a straightforward approach for evaluating forcefield quality, the 
bulk of this information has been tied up in formats that are not machine-readable. Compiling benchmark 
datasets of physical properties from non-machine-readable sources require substantial human effort and is 
prone to accumulation of human errors, hindering the development of reproducible benchmarks of forcefield 
accuracy. Here, we examine the feasibility of benchmarking atomistic forcefields against the NIST ThermoML 
data archive of physicochemical measurements, which aggregates thousands of experimental measurements in 
a portable, machine-readable, self-annotating format. As a proof of concept, we present a detailed benchmark 
of the generalized Amber small molecule forcefield (GAFF) using the AMl-BCC charge model against mea¬ 
surements (specifically bulk liquid densities and static dielectric constants at ambient pressure) automatically 
extracted from the archive, and discuss the extent of available data. The results of this benchmark highlight 
a general problem with fixed-charge forcefields in the representation low dielectric environments such as those 
seen in binding cavities or biological membranes. 
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I. INTRODUCTION 

Recent advances in hardware and software for molec¬ 
ular dynamics simulation now permit routine access to 
atomistic simulations at the 100 ns timescale and be¬ 
yond^. Leveraging these advances in combination with 
consumer GPU clusters, distributed computing, or cus¬ 
tom hardware has brought microsecond and millisecond 
simulation timescales within reach of many laboratories. 
These dramatic advances in sampling, however, have re¬ 
vealed deficiencies in forcefields as a critical barrier to 
enabling truly predictive simulations of physical proper¬ 
ties of biomolecular systems. 

Protein and water forcefields have been the subject 
of numerous benchmarks^^"^ and enhancements'’’^^, with 
key outcomes including the ability to fold fast-folding 
proteins®^^*^, improved fidelity of water thermodynamic 
properties'^, and improved prediction of NMR observ¬ 
ables. Although small molecule forcefields have also been 
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the subject of benchmarks^^^^^ and improvements^®, 
such work has typically focused on small perturbations to 
specific functional groups. For example, a recent study 
found that modified hydroxyl nonbonded parameters led 
to improved prediction of static dielectric constants and 
hydration free energies^®. There are also outstanding 
questions of generalizability of these targeted perturba¬ 
tions; it is uncertain whether changes to the parameters 
for a specific chemical moiety will be compatible with 
seemingly unrelated improvements to other groups. Ad¬ 
dressing these questions requires establishing community 
agreement upon shared benchmarks that can be easily 
replicated among laboratories to test proposed forcefield 
enhancements and expanded as the body of experimental 
data grows. 

A key barrier to establishing reproducible and exten¬ 
sible forcefield accuracy benchmarks is that many exper¬ 
imental datasets are heterogeneous, paywalled, and un¬ 
available in machine-readable formats (although notable 
counterexamples exist, e.g. the RGSB^®, FreeSolv^^, and 
the BMRB^®). While this inconvenience is relatively mi¬ 
nor for benchmarking forcefield accuracy for a single tar¬ 
get system (e.g. water), it becomes prohibitive for stud¬ 
ies spanning the large relevant chemical spaces, such as 
forcefields intended to describe a large variety of druglike 
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small organic molecules. 

In addition to inconvenience, the number and kind of 
human-induced errors that can corrupt hand-compiled 
benchmarks are legion. A United States Geological Sur¬ 
vey (USGS) case study examining the reporting and use 
of literature values of the aqueous solubility (Sw) and 
octanol-water partition coefficients (Kow) for DDT and 
its persistent metabolite DDE provides incredible insight 
into a variety of common errors^®. Secondary sources are 
often cited as primary sources—a phenomenon that oc¬ 
curred up to five levels deep in the case of DDT/DDE; 
citations for data are often incorrect, misattributed to 
unrelated publications, or omitted altogether; numerical 
data can be mistranscribed, transposed, or incorrectly 
converted among unit systems^®. This occurs to such a 
degree that the authors note “strings of erroneous data 
compose as much as 41-73 percent of the total data“^®. 
Given the incredible importance of these properties for 
human health and the environment, the quality of physic¬ 
ochemical datasets of far lesser importance is highly sus¬ 
pect. 

To ameliorate problems of data archival, the NIST 
Thermodynamics Research Genter (TRG) has de¬ 
veloped an lUPAG standard XML-based format— 
ThermoML^'^^^^ —for storing physicochemical measure¬ 
ments, uncertainties, and metadata. Manuscripts con¬ 
taining new experimental measurements submitted to 
several journals (J. Ghem. Eng. Data, J. Ghem. Therm., 
Fluid Phase Equil., Therm. Acta, and Int. J. Therm.) 
are guided through a data archival process that involves 
sanity checks, conversion to a standard machine-readable 
format, and archival at the TRG (http;//trc. nist. 
gov/ThermoML.html). 

Here, we examine the ThermoML archive as a potential 
source for a reproducible, extensible accuracy benchmark 
of biomolecular forcefields. In particular, we concentrate 
on two important physical property measurements eas¬ 
ily computable in many simulation codes—neat liquid 
density and static dielectric constant measurements— 
with the goal of developing a standard benchmark for 
validating these properties in fixed-charge forcefields of 
drug-like molecules and biopolymer residue analogues. 
These two properties provide sensitive tests of forcefield 
accuracy that are nonetheless straightforward to calcu¬ 
late. Using these data, we evaluate the generalized Am¬ 
ber small molecule forcefield (GAFF)^^’^^ with the AMl- 
BGG charge modeU®’^® and identify systematic biases to 
aid further forcefield refinement. 


II. METHODS 

A. ThermoML Archive retrieval and processing 

A tarball archive snapshot of the ThermoML Archive 
was obtained from the the NIST TRG on 8 Apr. 2015. To 
explore the content of this archive, we created a Python 
(version 2.7.9) tool (ThermoPyL: https;//github. com/ 


choderalab/ThermoPyL) that formats the XML content 
into a spreadsheet-like format accessible via the Pan¬ 
das (version 0.15.2) library. First, we obtained the 
XML schema (http;//media, iupac . org/namespaces/ 
ThermoML/ThermoML. xsd) defining the layout of the 
data. This schema was converted into a Python ob¬ 
ject via PyXB 1.2.4 (http://pyxb.sourceforge.net/). 
Finally, this schema was used to extract the data into 
Pandas^^ dataframes, and the successive data filters de¬ 
scribed in Section III A were applied. 


B. Simulation 

To enable automated accuracy benchmarking of 
physicochemical properties of neat liquids such as mass 
density and dielectric constant, we developed a semi- 
automated pipeline for preparing simulations, running 
them on a standard computer cluster using a portable 
simulation package, and analyzing the resulting data. All 
code for this procedure is available at https: //github. 
com/choderalab/LiquidBenchmcirk. Below, we describe 
the operation of the various stages of this pipeline and 
their application to the benchmark reported here. 


1. Preparation 

Ghemical names were parsed from the ThermoML ex¬ 
tract and converted to both GAS and smiles strings us¬ 
ing cirpy^®. Smiles strings were converted into molecu¬ 
lar structures using the OpenEye Python Toolkit version 
2015-2-3^®, as wrapped in openmoltools. 

Simulation boxes containing 1000 molecules were con¬ 
structed using PackMol version 14-225®*’’®^ wrapped in 
the Python automation library openmoltools. In order 
to ensure stable automated equilibration, PackMol box 
volumes were chosen to accommodate twice volume of 
the enclosed atoms, with atomic radii estimated as 1.06 
A and 1.53 A for hydrogens and nonhydrogens, respec¬ 
tively. 

For this illustrative benchmark, we utilized the gener¬ 
alized Amber small molecule forcefield (GAFF)^®’^^ with 
the AMl-BGG charge modeU^’^®, which we shall refer to 
as the GAFF/AMl-BGG forcefield. 

Ganonical AM1-BGG25.26,32 charges were generated 
with the OpenEye Python Toolkit version 2015-2- 
3^®, using the oequacpac. OEAssignPartialChcurges 
module with the 0ECharges_AMlBCCSym option, which 
utilizes a conformational expansion procedure (using 
oeomega. OEOmega®®) prior to charge fitting to minimize 
artifacts from intramolecular contacts. The OEOmega se¬ 
lected conformer was then processed using antechamber 
(with parmchk2) and tleap in Amber Tools 14®"^ to pro¬ 
duce Amber-format prmtop and inpcrd files, which were 
then read into OpenMM to perform molecular simula¬ 
tions using the simtk.openmm.app module. 


3 


The simulations reported here used libraries open- 
moltools 0.6.4^^, OpenMM 6.3^®, and MDTraj 1.3^^. Ex¬ 
act commands to install various dependencies can be 
found in Appendix A 1. 


2. Equilibration and production 

Simulation boxes were first minimized using the L- 
BFGS algorithm^® and equilibrated for 10^ steps with 
an equilibration timestep of 0.4 fs and a collision rate 
of 5 ps“^. Production simulations were performed with 
OpenMM 6.3®® using a Langevin Leapfrog integrator®® 
(with collision rate 1 ps“®) and a 1 fs timestep, as we 
found that timesteps of 2 fs timestep or greater led to a 
significant timestep dependence in computed equilibrium 
densities (Fig. 4). 

Equilibration and production simulations utilized a 
Monte Carlo barostat with a control pressure of 1 atm 
(101.325 kPa), utilizing molecular scaling and automated 
step size adjustment during equilibration, with volume 
moves attempted every 25 steps. The particle mesh 
Ewald (PME) method with conducting boundary con¬ 
ditions^® was used with a long-range cutoff of 0.95 nm 
and a long-range isotropic dispersion correction. PME 
grid and spline parameters were automatically selected 
using the default settings in OpenMM 6.3 for the CUBA 
platform®®. 

Automatic termination criteria. Production sim¬ 
ulations were continued until automatic analysis showed 
standard errors in densities were less than 2 x 10“^ g 
/ cm®. Automatic analysis of the production simula¬ 
tion data was run every 1 ns of simulation time, and 
utilized the detectEquilibration method in the time- 
series module of pymbar 2.1"*^® to automatically discard 
the initial portion of the production simulation contain¬ 
ing strong far-from-equilibrium behavior by maximizing 
the number of effectively uncorrelated samples in the re¬ 
mainder of the production simulation as determined by 
autocorrelation analysis using the fast adaptive statis¬ 
tical inefficiency computation method as implemented 
in the timeseries.computeStatisticallnefficiency 
method of pymbar 2.1 (where the algorithm is described 
in^^). This approach is essentially the same as the fixed- 
width procedure described by eq. 7.12 of ref.^®, with 
n* equal to 4000 and the sequential testing correction 
(n“® term) ignored due to the large value of n. Statis¬ 
tical errors were computed by S^p « var{p)/Neff, where 
var(p) is the sample variance of the density and Ngg is 
the number of effectively uncorrelated samples. With this 
protocol, we found starting trajectory lengths of 12000 
(8000,16000) frames (250 fs each), discarded regions of 
28 (0,460), and statistical inefficiencies of 20 (15, 28); re¬ 
ported numbers indicate (median, (25%, 75%)). 

Instantaneous densities were stored every 250 fs, while 
trajectory snapshots were stored every 5 ps. 


C. Timings 

The wall time required for a given simulation depends 
on the number of atoms (3,000 - 29,000), the GPU used 
(GTX 680 or GTX Titan), and the time required for au¬ 
tomated termination. For butyl acrylate (21,000 atoms) 
on a GTX Titan, the wall-clock performance is approxi¬ 
mately 80 ns / day. Using 80 ns / day with approximately 
3 ns of production simulation corresponds to 1 hour for 
the production segment of the simulation and 3 hours for 
the fixed equilibration portion of 10®" steps. 

1. Data analysis and statistical error estimation 

Trajectory analysis was performed using 
OpenMM 6.3®® and MDTraj 1.3®®". 

Mass density. Mass density p was computed via the 
relation. 



where M is the total mass of all particles in the system 
and V is the instantaneous volume of the simulation box. 

Static dielectric constants. Static dielectric con¬ 
stants were calculated using the dipole fluctuation ap¬ 
proach appropriate for PME with conducting (“tin-foil”) 
boundary conditions®®’^"®, with the total system box 
dipole p, computed from trajectory snapshots using MD¬ 
Traj 1.3®®". 

_ 1 , (m • m) - (m) ■ (m) 

" “ + ^ 3 (U) ^ ’ 

where /? = l/ksT is the inverse temperature. 

Computation of expectations. Expectations were 
estimated by computing sample means over the pro¬ 
duction simulation after discarding the initial far-from- 
equilibrium portion to equilibration (as described in Au¬ 
tomatic termination criteria above). 

Statistical uncertainties. For density uncertainties, 
the Markov chain standard error (MGSE) was estimated 
as , , where a is the density standard deviation of 

the simulation not discarded to equilibration, W// = ~ 
is the effective sample size, and g is the statistical ineffi¬ 
ciency as estimated from the density time series. For di¬ 
electric uncertainties, the portion of the production sim¬ 
ulation not discarded to equilibration was used as input 
to a circular block bootstrapping procedure^® with block 
sizes automatically selected to maximize the error"®®. 

2. Code availability 

Data analysis, all intermediate data (except configu¬ 
rational trajectories, due to their large size), and fig¬ 
ure creation code for this work is available at https; 
//github.com/choderalab/LiquidBenchmark. 
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III. RESULTS 

A. Extracting neat liquid measurements from the NIST TRC 
ThermoML Archive 

As described in Section II A, we retrieved a copy of 
the ThermoML Archive and performed a number of se¬ 
quential filtering steps to produce an ThermoML extract 
relevant for benchmarking forcefields describing small or¬ 
ganic molecules. As our aim is to explore neat liquid 
data with functional groups relevant to biopolymers and 
drug-like molecules, we applied the following ordered fil¬ 
ters, starting with all data containing density or static 
dielectric constants: 

1. The measured sample contains only a single com¬ 
ponent (e.g. no binary mixtures) 

2. The molecule contains only druglike elements (de¬ 
fined here as H, N, C, O, S, P, F, Cl, Br) 

3. The molecule has <10 non-hydrogen atoms 

4. The measurement was performed in a biophysically 
relevant temperature range (270 < T [K] < 330) 

5. The measurement was performed at ambient pres¬ 
sure (100 < P [kPa] < 102) 

6. Only measurements in liquid phase were retained 

7. The temperature and pressure were rounded to 
nearby values (as described below), averaging all 
measurements within each group of like conditions 


8. Only conditions (molecule, temperature, pressure) 
for which both density and dielectric constants were 
available were retained 


Filter step 

Number of measurements remaining 

Mass density 

Static dielectric 

1. 

Single Component 

136212 

1651 

2. 

Druglike Elements 

125953 

1651 

3. 

Heavy Atoms 

71595 

1569 

4. 

Temperature 

38821 

964 

5. 

Pressure 

14103 

461 

6. 

Liquid state 

14033 

461 

7. 

Aggregate T, P 

3592 

432 

8. 

Density-pDielectric 

246 

246 


TABLE I: Successive filtration of the ThermoML 
Archive. A set of successive filters were applied to all 
measurements in the ThermoML Archive that 
contained either mass density or static dielectric 
constant measurements. Each column reports the 
number of measurements remaining after successive 
application of the corresponding filtration step. The 246 
final measurements correspond to 45 unique molecules 
measured at several temperature conditions. 


Functional Group 

Occurrences 

1,2-aminoalcohol 

4 

1,2-diol 

3 

alkene 

3 

aromatic compound 

1 

carbonic acid diester 

2 

carboxylic acid ester 

4 

dialkyl ether 

7 

heterocyclic compound 

3 

ketone 

3 

lactone 

1 

primary alcohol 

19 

primary aliphatic amine (alkylamine) 

2 

primary amine 

2 

secondary alcohol 

4 

secondary aliphatic amine (dialkylamine) 

2 

secondary aliphatic/aromatic amine (alkylarylamine) 

1 

secondary amine 

3 

sulfone 

1 

sulfoxide 

1 

tertiary aliphatic amine (trialkylamine) 

3 

tertiary amine 

3 


The temperature and pressure rounding step was moti¬ 
vated by common data reporting variations; for example, 
an experiment performed at the freezing temperature of 
water and ambient pressure might be entered as either 
101.325 kPa or 100 kPa, with a temperature of either 
273 K or 273.15 K. Therefore all pressures within the 
range [kPa] (100 < P < 102) were rounded to exactly 1 
atm (101.325 kPa). Temperatures were rounded to one 
decimal place in K. 

The application of these filters (Table I) leaves 246 
conditions—where a condition here indicates a (molecule, 
temperature, pressure) tuple—for which both density 
and dielectric data are available. The functional groups 
present in the resulting dataset are summarized in Ta¬ 
ble II; see Section IIA for further description of the soft¬ 
ware pipeline used. 


TABLE II: Functional groups present in filtered 
dataset. The filtered ThermoML dataset contained 246 
distinct (molecule, temperature, pressure) conditions, 
spanning 45 unique compounds. The functional groups 
represented in these compounds (as identified by the 
program checkmol vO.5^^) is summarized here. 

B. Benchmarking GAFF/AMl-BCC against the ThermoML 
Archive 

1. Mass density 

Mass densities of bulk liquids have been widely used for 
parameterizing and testing forcefields, particularly the 
Lennard-Jones parameters representing dispersive and 
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repulsive interactions^®’^®. We therefore used the present 
ThermoML extract as a benchmark of the GAFF/AMl- 
BCC forcefield (Fig. 1). 

Overall accuracy. Overall, the densities show rea¬ 
sonable accuracy, with a root-mean square (RMS) rela¬ 
tive error over all measurements of (3.0±0.1)%, especially 
encouraging given that this forcefield was not designed 
with the intention of modeling bulk liquid properties of 
organic molecules®®’®^. This is reasonably consistent with 
previous studies reporting agreement of 4% on a different 
benchmark set^^. 

Temperature dependence. For a given compound, 
the signs of the errors typically do not change at different 
temperatures (Fig. 1, Fig. 7). Furthermore, the magni¬ 
tudes of the error also remain largely constant (vertical 
lines in Fig. 1 B), although several exceptions do occur. 
It is possible that these systematic density offsets indi¬ 
cate correctable biases in forcefield parameters. 

Outliers. The largest density errors occur for a num¬ 
ber of oxygen-containing compounds: 1,4-dioxane; 2,5,8- 
trioxanonane; 2-aminoethanol; dimethyl carbonate; for- 
mamide; and water (Fig. 7). The absolute error on 
these poor predictions is on the order of 0.05 g/cm®, 
which is substantially higher than the measurement error 
(< 0.008 g/cm®; see Fig. 5). 

We note that our benchmark includes a GAFF/AMl- 
BCC model for water due to our desire to automate 
benchmarks against a forcefield capable of modeling a 
large variety of small molecular liquids. Water—an in¬ 
credibly important solvent in biomolecular systems—is 
generally treated with a special-purpose model (such as 
TIP3P^® or TIP4P-Ew^^) parameterized to fit a large 
quantity of thermophysical data. As expected, the 
GAFF/AMl-BGG model performs poorly in reproducing 
liquid densities for this very special solvent. We conclude 
that it remains highly advisable that the field continue 
to use specialized water models when possible. 


2. Static dielectric constant 



Predicted (GAFF) 

(a) 


Density [g /cm*3] (relative rms: 0.030 ± 0.001) 







0.6 

- 0.10 


0.00 0.05 

Predicted - Experiment 


(b) 


Overall accuracy. As a measure of the dielectric re¬ 
sponse, the static dielectric constant of neat liquids pro¬ 
vides a critical benchmark of the accuracy electrostatic 
treatment in forcefield models. Discussing the accuracy 
in terms the ability of GAFF/AMl-BGG to reproduce 
the static dielectric constant e is not necessarily meaning¬ 
ful because of the way that the solvent dielectric e enters 
into the Goulomb potential between two point charges 
separated by a distance r, 

TJ( ^ Q1Q2 I /Q\ 

U [r) = - (X -. (3) 

er e 

It is evident that 1/e is a much more meaningful quantity 
to compare than e directly, as a 5% error in 1/e will cause 
a 5% error in the Goulomb potential between two point 
charges (assuming a uniform dielectric), while a 5% error 
in e will have a much more complex e-dependent effect 


FIG. 1: Comparison of liquid densities between 
experiment and simulation, (a). Liquid density 
measurements extracted from ThermoML are compared 
against densities predicted using the GAFF / 
AMl-BGG small molecule fixed-charge forcefield. Golor 
groupings represent identical chemical species, although 
the color map repeats itself due to the large (45) 
number of unique compounds. Plots of density versus 
temperature grouped by chemical species are available 
in Fig. 7. Simulation error bars represent one standard 
error of the mean, with the number of effective 
(uncorrelated) samples estimated using pymbar. 
Experimental error bars indicate the standard deviation 
between independently reported measurements, when 
available, or author-reported standard deviations in 
ThermoML entries; for some measurements, neither 
uncertainty estimate is available. See Fig. 5 for further 
discussion of error, (b). The same plot, but with the 
residual (predicted minus experiment) on the x axis. 
Note that the error bars are all smaller than the 
symbols. 
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on the Coulomb potential. We therefore compare simu¬ 
lations against measurements in our ThermoML extract 
on the 1/e scale in Fig. 2. 

GAFF/AMl-BCC systematically underesti¬ 
mates the dielectric constants of nonpolar liquids. 

Overall, we find the dielectric constants to be qualita¬ 
tively reasonable, but with clear deviations from exper¬ 
iment particularly for nonpolar liquids. This is not sur¬ 
prising given the complete neglect of electronic polariza¬ 
tion which will be the dominant contribution for such 
liquids. In particular, GAFF/AMl-BCC systematically 
underestimates the dielectric constants for nonpolar liq¬ 
uids, with the predictions of e « 1.0 being substantially 
smaller than the measured e « 2. Because this devia¬ 
tion likely stems from the lack of an explicit treatment 
of electronic polarization, we used a simple empirical po¬ 
larization model that computes the molecular electronic 
polarizability a as a sum of elemental atomic polarizabil¬ 
ity contributions®^. 

From the computed molecular electronic polarizability 
a, an additive correction to the simulation-derived static 
dielectric constant accounting for the missing electronic 
polarizability can be computed^^ 

Ae = 47rA-^ (4) 

A similar polarization correction was used in the develop¬ 
ment of the TIP4P-Ew water model, where it had a minor 
effect because almost all the high static dielectric con¬ 
stant for water comes from the configurational response 
of its strong dipole. However, the missing polarizability 
is a dominant contribution to the static dielectric con¬ 
stant of nonpolar organic molecules; in the case of water, 
the empirical atomic polarizability model predicts a di¬ 
electric correction of 0.52, while 0.79 was used for the 
TIP4P-Ew model. Averaging all liquids in the present 
work leads to polarizability corrections to the static di¬ 
electric of 0.74 ±0.08. Taking the dataset as a whole, we 
find that the relative error in uncorrected dielectric is on 
the order of —0.34 ± 0.02, as compared to —0.25 ± 0.02 
for the corrected dielectric. 


IV. DISCUSSION 
A. Mass densities 

Our simulations have indicated the presence of sys¬ 
tematic density biases with magnitudes larger than the 
measurement error. Correcting these errors may be a 
low-hanging fruit for future forcefield refinements. As an 
example of the feasibility of improved accuracy in densi¬ 
ties, a recent three-point water model was able to recapit¬ 
ulate water density with errors of less than 0.005 g / cm® 
over temperature range [280 K, 320 K]®^. This improved 
accuracy in density prediction was obtained alongside ac¬ 
curate predictions of other experimental observables, in¬ 
cluding static dielectric constant. We suspect that such 



Predicted (GAFF) 

FIG. 2: Measured (ThermoML) versus predicted 
(GAFF / AMl-BCC) inverse static dielectrics 
(a). Simulation error bars represent one standard error 
of the mean. Experimental error bars indicate the 
larger of standard deviation between independently 
reported measurements and the authors reported 
standard deviations; for some measurements, neither 
uncertainty estimate is available. See Fig. 5 for further 
discussion of error. See Section IIIB2 for explanation of 
why inverse dielectric constant (rather than dielectric 
constant) is plotted. For nonpolar liquids, it is clear 
that the forcefield predicts electrostatic interactions 
that are substantially biased by missing polarizability. 
Plots of dielectric constant versus temperature grouped 
by chemical species are available in Fig. 8. 

accuracy might be obtainable for GAFF-like forcefields 
across some portion of chemical space. A key challenge 
for the field is to demarcate the fundamental limit of 
fixed-charge forcefields for predicting orthogonal classes 
of experimental observables. For example, is it possible 
to achieve a relative density error of 10“^ without sac¬ 
rificing accuracy of other properties such as enthalpies? 
In our opinion, the best way to answer such questions is 
to systematically build forcefields with the goal of pre¬ 
dicting various properties to within their known experi¬ 
mental uncertainties, similar to what has been done for 
water^^’®^. 


B. Dielectric constants in forcefield parameterization 

A key feature of the static dielectric constant for a liq¬ 
uid is that, for forcefield purposes, it consists of two very 
different components, distinguished by the dependence 
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on the fixed charges of the forcefield and dynamic mo¬ 
tion of the molecule. One component, the high-frequency 
dielectric constant, arises from the almost-instantaneous 
electronic polarization in response to the external elec¬ 
tric field: this contributes a small component, generally 
around e = 2 , which can be dominant for non-polar liq¬ 
uids but is completely neglected by the non-polarizable 
forcefields in common use for biomolecular simulations. 
The other component arises from the dynamical response 
of the molecule, through nuclear motion, to allow its var¬ 
ious molecular multipoles to respond to the external elec¬ 
tric field: for polar liquids such as water, this contributes 
the majority of the dielectric constant. Thus for polar 
liquids, we expect the parameterized atomic charges to 
play a major role in the static dielectric. 

Recent forcefield development has seen a resurgence 
of papers fitting dielectric constants during forcefield 
parameterization^®’®^. However, a number of authors 
have pointed out potential challenges in constructing self- 
consistent fixed-charge forcefields®^’®®. 

Interestingly, recent work by Dill and coworkers®^ ob¬ 
served that, for CCI 4 , reasonable choices of point charges 
are incapable of recapitulating the observed dielectric of 
e = 2 . 2 , instead producing dielectric constants in the 
range of 1.0 < e < 1.05. This behavior is quite general: 
fixed point charge forcefields will predict e « 1 for many 
nonpolar or symmetric molecules, but the measured di¬ 
electric constants are instead e « 2 (Fig. 3). While this 
behavior is well-known and results from missing physics 
of polarizability, we suspect it may have several profound 
consequences, which we discuss below. 

Suppose, for example, that one attempts to fit force- 
field parameters to match the static dielectric constants 
of CCI 4 , CHCI 3 , CH 2 CI 2 , and CH 3 CI. In moving from 
the tetrahedrally-symmetric CCI 4 to the asymmetric 
CHCI 3 , it suddenly becomes possible to achieve the ob¬ 
served dielectric constant of 4.8 by an appropriate choice 
of point charges. However, the model for CHCI 3 uses 
fixed point charges to account for both the permanent 
dipole moment and the electronic polarizability, whereas 
the CCI 4 model contains no treatment of polarizabil¬ 
ity. We hypothesize that this inconsistency in param¬ 
eterization may lead to strange mismatches, where sym¬ 
metric molecules (e.g. benzene and CCI 4 ) have qualita¬ 
tively different properties than closely related asymmet¬ 
ric molecules (e.g. toluene and CHCI 3 ). 

How important is this effect? We expect it to be im¬ 
portant wherever we encounter the transfer of a polar 
molecule (such as a peptide, native ligand, or a pharma¬ 
ceutical small molecule) from a polar environment (such 
as the cytosol, interstitial fluid, or blood) into a non-polar 
environment (such as a biological membrane or non-polar 
binding site of an enzyme or receptor). Thus we ex¬ 
pect this to be implicated in biological processes ranging 
from ligand binding to absorption and distribution within 
the body. To understand this conceptually, consider the 
transfer of a polar small-molecule transfer from the non¬ 
polar interior of a lipid bilayer to the aqueous and hence 
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FIG. 3: Typical experimental static dielectric 
constants of some nonpolar compounds, (a). 
Measured static dielectric constants of various nonpolar 
or symmetric molecules®^’®®. Fixed-charge forcefields 
give e « 1 for each species; for example, we calculated 
e = 1.0030 ± 0.0002 for octane, (b). A congeneric series 
of chloro-substituted methanes have static dielectric 
constants between 2 and 13. Reported dielectric 
constants are at near-ambient temperatures. 


very polar cytosol. As a possible real-world example, 
we imagine that the missing atomic polarizability could 
be important in accurate transfer free energies involving 
low-dielectric solvents, such as the small-molecule trans¬ 
fer free energy from octanol or cyclohexane to water. The 
Onsager model for solvation of a dipole fjL of radius a gives 
us a way to estimate the magnitude of error introduced 
by making an error of Ae in the static dielectric constant 
of a solvent. The free energy of dipole solvation is given 
by this model as 


AG = - 


e - 1 
~^2e + l 


( 5 ) 


such that, for an error of Ae departing from the true 
static dielectric constant e, we find the error in solvation 
is 


AAG = - 


^ r (e -k Ae) - 1 
a® _2(e-I-Ae)-I-1 


e- 1 
2e + l 


( 6 ) 


For example, the solvation of water (a = 1.93 A, 
/i = 2.2 D) in a low dielectric medium such as tetra- 
chloromethane or benzene (e ~ 2 . 2 , but Ae = — 1 . 2 ) gives 
an error of AAG ~ — 8 kJ/mol (-2 kcal/mol). 

Implications for transfer free energies. As an¬ 
other example, consider the transfer of small druglike 















molecules from a nonpolar solvent (such as cyclohexane) 
to water, a property often measured to indicate the ex¬ 
pected degree of lipophilicity of a compound. To estimate 
the magnitude of error expected, for each molecule in the 
latest (Feb. 20) FreeSolv database^^’®®, we estimated the 
expected error in computed transfer free energies should 
GAFF/AMl-BCC be used to model the nonpolar solvent 
cyclohexane using the Onsager model (Eq. 6). We used 
took the cavity radius a to be the half the maximum 
interatomic distance and calculated p. = QiVi using 
the provided mol2 coordinates and AMl-BCC charges. 
This calculation predicts a mean error of (—3.8 ± 0.3) kJ 
/ mol [(—0.91 ± 0.07) kcal / mol] for the 643 molecules 
(where the standard error is computed from bootstrap¬ 
ping over FreeSolv compound measurements), suggesting 
that the missing atomic polarizabilty unrepresentable by 
fixed point charge forcefields could contribute substan¬ 
tially to errors in predicted transfer and solvation prop¬ 
erties of druglike molecules. In other words, the use of a 
fixed-charge physics may lead to errors of 3.8 kJ / mol 
in cyclohexane transfer free energies. We conjecture that 
this missing physics will be important in the upcoming 
(2015) SAMPL challenge®^, which will examine transfer 
free energies in several low dielectric media. 

Utility in parameterization. Given their ease of 
measurement and direct connection to long-range electro¬ 
static interactions, static dielectric constants have high 
potential utility as primary data for forcefield param¬ 
eterization efforts. Although this will require the use 
of forcefields with explicit treatment of atomic polariz¬ 
ability, the inconsistency of fixed-charge models in low- 
dielectric media is sufficiently alarming to motivate fur¬ 
ther study of polarizable forcefields. In particular, con¬ 
tinuum methods®®^®®, point dipole methods®^’®^, and 
Drude methods®®’®^ have been maturing rapidly. Finding 
the optimal balance of accuracy and performance remains 
an open question; however, the use of experimentally- 
parameterized direct polarization methods®® may provide 
polarizability physics at a cost not much greater than 
fixed charge forcefields. 


C. ThermoML as a data source 

The present work has focused on the neat liquid den¬ 
sity and dielectric measurements present in the Ther¬ 
moML Archive^®’^®’®® as a target for molecular dynam¬ 
ics forcefield validation. While liquid mass densities and 
static dielectric constants have already been widely used 
in forcefield work, several aspects of ThermoML make it a 
unique resource for the forcefield community. First, the 
aggregation, support, and dissemination of ThermoML 
datasets through the ThermoML Archive is supported by 
NIST, whose mission makes these tasks a long-term pri¬ 
ority. Second, the ThermoML Archive is actively grow¬ 
ing, through partnerships with several journals, and new 
experimental measurements published in these journals 
are critically examined by the TRC and included in the 


archive. Finally, the files in the ThermoML Archive 
are portable and machine readable via a formal XML 
schema, allowing facile access to hundreds of thousands 
of measurements. Numerous additional physical prop¬ 
erties contained in ThermoML—including activity coef¬ 
ficients, diffusion constants, boiling point temperatures, 
critical pressures and densities, coefficients of expansion, 
speed of sound measurements, viscosities, excess molar 
enthalpies, heat capacities, and volumes—for neat phases 
and mixtures represent a rich dataset of high utility for 
forcefield validation and parameterization. 


V. CONCLUSIONS 

High quality, machine-readable datasets of physico¬ 
chemical measurements will be required for the construc¬ 
tion of next-generation small molecule forcefields. Here 
we have discussed the NIST/TRC ThermoML archive as 
a growing source of physicochemical measurements that 
may be useful for the forcefield community. From the 
NIST/TRC ThermoML archive, we selected a dataset of 
246 ambient, neat liquid systems for which both den¬ 
sities and static dielectric constants are available. Us¬ 
ing this dataset, we benchmarked GAFF/AMl-BCC, one 
of the most popular small molecule forcefields. We find 
systematic biases in densities and particularly static di¬ 
electric constants. Element-based empirical polarizabilty 
models are able to account for much of the systematic 
differences between GAFF/AMl-BCC and experiment. 
Non-polarizable forcefields may show unacceptable bi¬ 
ases when predicting transfer and binding properties of 
non-polar environments such as binding cavities or mem¬ 
branes. 
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VII. DISCLAIMERS 

This contribution of the National Institute of Stan¬ 
dards and Technology (NIST) is not subject to copyright 
in the United States. Products or companies named here 
are cited only in the interest of complete technical de¬ 
scription, and neither constitute nor imply endorsement 
by NIST or by the U.S. government. Other products may 
be found to serve as well. 
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Appendix A: Appendices 

• Figure: Timestep-dependence of density 

• Figure: Error analysis (density) for ThermoML 
dataset 

• Figure: Error analysis (static dielectric constant) 
for ThermoML dataset 

• Figure: Temperature Dependence: Density 

• Figure: Temperature Dependence: Static Dielec¬ 
tric Constant 

• Commands to install dependencies 
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Timestep dependence: Density 
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FIG. 4: Dependence of computed density on 
simulation timestep. To probe the systematic error 
from finite time-step integration, we examined the 
timestep dependence of butyl acrylate density, (a). The 
density is shown for several choices of timestep. (b). 
The relative error, as compared to the reference value, 
is shown for several choices of timestep. Error bars 
represent standard errors of the mean, with the number 
of effective samples estimated using pymbar’s statistical 
inefficiency routine^^. The reference value is estimated 
by linear extrapolation to 0 fs using the 0.5 fs and 1.0 fs 
data points; the linear extrapolation is shown as black 
lines. We find a 2 fs timestep leads to systematic biases 
in the density on the order of 0.13%, while 1 fs reduces 

the systematic bias to approximately 0.8%—we 
therefore selected a 1 fs timestep for the present work, 
where we aimed to achieve three digits of accuracy in 
density predictions. 
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Error Estimates: Density fa / cm-^Sl 
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FIG. 5: Assessment of experimental error: 
Density To assess the experimental error in our 
ThermoML extract, we compared three different 
estimates of uncertainty. In the first approach 
(Weighted), we computed the standard deviation of the 
optimally weighted average of the measurements, using 
the uncertainties reported by authors 
{(^Weighted = Efc ■ This Uncertainty estimator 

places the highest weights on measurements with small 
uncertainties and is therefore easily dominated by small 
outliers and uncertainty under-reporting. In the second 
approach (Median), we estimated the median of the 
uncertainties reported by authors; this statistic should 
be robust to small and large outliers of author-reported 
uncertainties. In the third approach (Std), we calculated 
at the standard deviation of independent measurements 
reported in the ThermoML extract, completely avoiding 
the author-reported uncertainties. Plot (a) compares 
the three uncertaiSnty estimates. We see that 
author-reported uncertainties appear to be substantially 
smaller than the scatter between the observed 
measurements. A simple psychological explanation 
might be that because density measurements are more 
routine, the authors simply report the repeatability 
stated by the manufacturer (e.g. 0.0001 g / cm^ for a 
Mettler Toledo DM40®^). However, this hardware limit 
is not achieved due to inconsistencies in sample 
preparation and experimental conditions; see Appendix 
in Ref.^^. Panel (b) shows the same information as (a) 
but as a function of the measurement index, rather than 
as a scatter plot—because not all measurements have 
author-supplied uncertainties, panel (c) contains slightly 
more data points than (a, b). 
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Error Estimates: Relative permittivity at zero frequency 
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Error Estimates: Relative permittivity at zero frequency 
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FIG. 6: Assessment of experimental error: Static 
Dielectric Constant To assess the experimental error 
in our ThermoML extract, we compared three different 
estimates of uncertainty. In the first approach 
(Weighted), we computed the standard deviation of the 
optimally weighted average of the measurements, using 
the uncertainties reported by authors 
{(^Weighted = T^is Uncertainty estimator 

places the highest weights on measurements with small 
uncertainties and is therefore easily dominated by small 
outliers and uncertainty under-reporting. In the second 
approach (Median), we estimated the median of the 
uncertainties reported by authors; this statistic should 
be robust to small and large outliers of author-reported 
uncertainties. In the third approach (Std), we calculated 
at the standard deviation of independent measurements 
reported in the ThermoML extract, completely avoiding 
the author-reported uncertainties. Plot (a) compares 
the three uncertainty estimates. Unlike the case of 
densities, author-reported uncertainties appear to be 
somewhat larger than the scatter between the observed 
measurements. Panel (b) shows the same information as 
(a) but as a function of the measurement index, rather 
than as a scatter plot—because not all measurements 
have author-supplied uncertainties, panel (c) contains 
slightly more data points than (a, b). 
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FIG. 7: Comparison of simulated and experimental densities for all compounds. Measured (blue) and 

simulated (green) densities are shown in units of g / cm^. 
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simulated (green) densities are shown in units of g / cm^. 
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FIG. 8: Comparison of simulated and experimental static dielectric constants for all compounds. 

Measured (blue), simulated (green), and polarizability-corrected simulated (red) static dielectric constants are shown 

for all compounds. 
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FIG. 8: Comparison of simulated and experimental static dielectric constants for all compounds. 

Measured (blue), simulated (green), and polarizability-corrected simulated (red) static dielectric constants are shown 

for all compounds. 
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FIG. 8: Comparison of simulated and experimental static dielectric constants for all compounds. 

Measured (blue), simulated (green), and polarizability-corrected simulated (red) static dielectric constants are shown 

for all compounds. 
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FIG. 8: Comparison of simulated and experimental static dielectric constants for all compounds. 

Measured (blue), simulated (green), and polarizability-corrected simulated (red) static dielectric constants are shown 

for all compounds. 
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FIG. 8: Comparison of simulated and experimental static dielectric constants for all compounds. 

Measured (blue), simulated (green), and polarizability-corrected simulated (red) static dielectric constants are shown 

for all compounds. 
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1. Dependency Installation 

The following shell commands can be used to install the 
necessary prerequisites via the conda package manager 
for Python: 

$ conda config —add channels http://conda.binstar.org/omnia 

$ conda install " openmoltools " " pymbar==2.1" " mdtraj ==1.3" "openmm==6.3" packmol 

% 


Note that this command installs the exact versions 
used in the present study, with the exception of open¬ 
moltools for which only a more recent package is avail¬ 
able. However, for authors interested in extending the 
present work, we suggust using the most up-to-date ver¬ 
sions available instead, which involves replace the equal¬ 
ity symbols == with >=. 
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